Exploration of Red Wine Quality by Kyungwon Chun

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

This tidy dataset contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating from 0 (very bad) to 10 (very excellent).

Univariate Plots Section

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The distribution of fixed acidity is positive skewed. Most of the wines have fixed acidity between 7.10 and 9.20.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The valatile acidity shows a bimodal distribution and positive skewness. Most of the wines have volatile acidity between 0.39 and 0.64.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.120   7.680   8.445   8.847   9.740  16.285

Total acidity is composed of fixed and volatile acidity. The distribution of total acidity is positive skewed with median at 8.445.

The residual sugar shows left-biased and long-tailed distribution.

The chlorides show left-biased and long-tailed distribution.

The total sulfur dioxide has some outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Most of the wines have a density between 0.9956 and 0.9978.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Most of the wines have pH between 3.210 and 3.400.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Most of the wines have 5 or 6 in quality.

Univariate Analysis

What is the structure of your dataset?

There are 1,5999 red wines in the dataset with 13 features (X, fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality). X identifies the wines, and quality represents that how good the wine. The X and quality are unordered and ordered factor variables, but I treated them as numerical variables for convenience. All other features represent chemical properties of wine.

Other observations:

  • Wines with quality 5 or 6 are most common.
  • The median wine quality is 6.
  • Most wines have a quality of 5 or better.
  • About 75% of wines have a quality of 6 or worse.
  • The worst and best quality in the data set is 3 and 8, respectively.

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is quality. I’d like to determine which features are best for predicting the wine quality. I suspect quality and some combination of the other variables can be used to build a predictive model for wine quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The primary wine characteristics are sweetness, acidity, tannin, alcohol, and body. Residual sugar, fixed and volatile acidity, alcohol, and density determine those characteristics. I guess that these variables are mainly related to the wine quality.

Did you create any new variables from existing variables in the dataset?

I created a variable for the total acidity using the volatile and the fixed acids.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Volatile acidity shows a bimodal distribution.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.26        0.67
## volatile.acidity             -0.26             1.00       -0.55
## citric.acid                   0.67            -0.55        1.00
## residual.sugar                0.11             0.00        0.14
## chlorides                     0.09             0.06        0.20
## free.sulfur.dioxide          -0.15            -0.01       -0.06
## total.sulfur.dioxide         -0.11             0.08        0.04
## density                       0.67             0.02        0.36
## pH                           -0.68             0.23       -0.54
## sulphates                     0.18            -0.26        0.31
## alcohol                      -0.06            -0.20        0.11
## quality                       0.12            -0.39        0.23
## total.acidity                 0.99            -0.16        0.63
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.11      0.09               -0.15
## volatile.acidity               0.00      0.06               -0.01
## citric.acid                    0.14      0.20               -0.06
## residual.sugar                 1.00      0.06                0.19
## chlorides                      0.06      1.00                0.01
## free.sulfur.dioxide            0.19      0.01                1.00
## total.sulfur.dioxide           0.20      0.05                0.67
## density                        0.36      0.20               -0.02
## pH                            -0.09     -0.27                0.07
## sulphates                      0.01      0.37                0.05
## alcohol                        0.04     -0.22               -0.07
## quality                        0.01     -0.13               -0.05
## total.acidity                  0.12      0.10               -0.16
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                       -0.11    0.67 -0.68      0.18   -0.06
## volatile.acidity                     0.08    0.02  0.23     -0.26   -0.20
## citric.acid                          0.04    0.36 -0.54      0.31    0.11
## residual.sugar                       0.20    0.36 -0.09      0.01    0.04
## chlorides                            0.05    0.20 -0.27      0.37   -0.22
## free.sulfur.dioxide                  0.67   -0.02  0.07      0.05   -0.07
## total.sulfur.dioxide                 1.00    0.07 -0.07      0.04   -0.21
## density                              0.07    1.00 -0.34      0.15   -0.50
## pH                                  -0.07   -0.34  1.00     -0.20    0.21
## sulphates                            0.04    0.15 -0.20      1.00    0.09
## alcohol                             -0.21   -0.50  0.21      0.09    1.00
## quality                             -0.19   -0.17 -0.06      0.25    0.48
## total.acidity                       -0.11    0.68 -0.67      0.16   -0.08
##                      quality total.acidity
## fixed.acidity           0.12          0.99
## volatile.acidity       -0.39         -0.16
## citric.acid             0.23          0.63
## residual.sugar          0.01          0.12
## chlorides              -0.13          0.10
## free.sulfur.dioxide    -0.05         -0.16
## total.sulfur.dioxide   -0.19         -0.11
## density                -0.17          0.68
## pH                     -0.06         -0.67
## sulphates               0.25          0.16
## alcohol                 0.48         -0.08
## quality                 1.00          0.09
## total.acidity           0.09          1.00

The fixed acidity and volatile acidity has strong positive and negative correlations with citric acid.

The pH has a strong negative correlation with fixed acidity, citric acid, but does not with volatile acidity.

The fixed acidity and alcohol have significant positive and negative correlations with density, respectively.

Most of the variables do not seem to have strong correlations with quality, but alcohol and volatile acidity have moderate positive and negative correlation with quality, respectively.

The strongest correlation in this data set appears between fixed acidity and pH. High acidity means low pH, and the graph coincides with this fact.

Citric acid is ne of the main component of fixed acidity. Therefore the two variable has a strong positive correlation.

The fixed acidity has a strong positive correlation with density, too.

Yeast in wine convert citric acid to acetic acid, most of the volatile acid. Therefore, volatile acidity and citric acid is in a reverse relation.

The citric acid has moderate negative correlations with volatile acidity and pH.

The alcohol and density also show moderate negative correlation.

Quality of wine tends to increase as volatile acidity decreases, because the main component of volatile acid is acetic acid which causes an unpleasant vinegar taste.

## 
## Call:
## lm(formula = quality ~ volatile.acidity, data = wqr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.79071 -0.54411 -0.00687  0.47350  2.93148 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.56575    0.05791  113.39   <2e-16 ***
## volatile.acidity -1.76144    0.10389  -16.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7437 on 1597 degrees of freedom
## Multiple R-squared:  0.1525, Adjusted R-squared:  0.152 
## F-statistic: 287.4 on 1 and 1597 DF,  p-value: < 2.2e-16

Based on the value of R-squared, volatile acidity contributes only about 15.2% of the Wine quality.

## 
## Call:
## lm(formula = quality ~ I(sqrt(alcohol)), data = wqr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8551 -0.4087 -0.1711  0.5115  2.5870 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.0237     0.3538   -5.72 1.27e-08 ***
## I(sqrt(alcohol))   2.3756     0.1096   21.68  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7101 on 1597 degrees of freedom
## Multiple R-squared:  0.2274, Adjusted R-squared:  0.2269 
## F-statistic: 469.9 on 1 and 1597 DF,  p-value: < 2.2e-16

Based on the value of R-squared, alcohol contributes to the wine quality only about 15.2%.

Residual sugar determines the sweetness of the wine. Most of the wine maintain an certain level of sweetness.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The quality correlates with alcohol and volatile acidity.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Citric acid is one of the main components of fixed acidity. As a result, they have a strong positive correlation.

High fixed acidity causes low pH. Therefore, fixed acidity and citric acid negatively correlates with the pH.

Wine with more volatile acidity tends to have less citric acid.

Wine with more fixed acidity tends to denser. By the way, A wine with more alcohol tends to less dense.

What was the strongest relationship you found?

The fixed acidity is positively and strongly correlated with citric acid and density. The citric acid may substitute for fixed acidity and density with even better estimation of wine quality.

Multivariate Plots Section

c(cor(wqr$volatile.acidity, wqr$sulphates), 
  cor(wqr$volatile.acidity, log10(wqr$sulphates)))
## [1] -0.2609867 -0.3005487

Transformation of sulphates to log10(sulphates) increase the correlation between sulphates and volatile acidity.

c(cor(wqr$alcohol, wqr$pH), cor(wqr$alcohol, wqr$pH^7))
## [1] 0.2056325 0.2287039

Transformation of pH to pH^7 increases the correlation between pH and alcohol little bit. As shown below, this leads the increase of our model accuracy little bit.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wqr)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wqr)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = wqr)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     chlorides, data = wqr)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     chlorides + total.sulfur.dioxide, data = wqr)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     chlorides + total.sulfur.dioxide + pH, data = wqr)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     chlorides + total.sulfur.dioxide + pH + citric.acid, data = wqr)
## 
## ==========================================================================================================================
##                              m1            m2            m3            m4            m5            m6            m7       
## --------------------------------------------------------------------------------------------------------------------------
##   (Intercept)               1.875***      3.095***      2.611***      2.777***      3.005***      4.296***      4.613***  
##                            (0.175)       (0.184)       (0.196)       (0.199)       (0.204)       (0.400)       (0.461)    
##   alcohol                   0.361***      0.314***      0.309***      0.292***      0.277***      0.291***      0.295***  
##                            (0.017)       (0.016)       (0.016)       (0.016)       (0.016)       (0.017)       (0.017)    
##   volatile.acidity                       -1.384***     -1.221***     -1.167***     -1.142***     -1.038***     -1.115***  
##                                          (0.095)       (0.097)       (0.097)       (0.097)       (0.100)       (0.115)    
##   sulphates                                             0.679***      0.874***      0.915***      0.889***      0.899***  
##                                                        (0.101)       (0.111)       (0.110)       (0.110)       (0.110)    
##   chlorides                                                          -1.645***     -1.705***     -2.002***     -1.915***  
##                                                                      (0.394)       (0.392)       (0.398)       (0.403)    
##   total.sulfur.dioxide                                                             -0.002***     -0.002***     -0.002***  
##                                                                                    (0.001)       (0.001)       (0.001)    
##   pH                                                                                             -0.435***     -0.525***  
##                                                                                                  (0.116)       (0.133)    
##   citric.acid                                                                                                  -0.167     
##                                                                                                                (0.121)    
## --------------------------------------------------------------------------------------------------------------------------
##   R-squared                 0.227         0.317         0.336         0.343         0.351         0.357         0.358     
##   adj. R-squared            0.226         0.316         0.335         0.341         0.349         0.355         0.355     
##   sigma                     0.710         0.668         0.659         0.655         0.651         0.649         0.649     
##   F                       468.267       370.379       268.912       208.125       172.683       147.427       126.712     
##   p                         0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -1721.057     -1621.814     -1599.384     -1590.682     -1580.383     -1573.351     -1572.389     
##   Deviance                805.870       711.796       692.105       684.612       675.850       669.931       669.126     
##   AIC                    3448.114      3251.628      3208.768      3193.364      3174.767      3162.701      3162.778     
##   BIC                    3464.245      3273.136      3235.654      3225.626      3212.407      3205.719      3211.173     
##   N                      1599          1599          1599          1599          1599          1599          1599         
## ==========================================================================================================================

The first trial of linear model accounts for 35.7% of the variance. The variables with less significance were removed.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wqr)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wqr)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)), 
##     data = wqr)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) + 
##     chlorides, data = wqr)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) + 
##     chlorides + total.sulfur.dioxide, data = wqr)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) + 
##     chlorides + total.sulfur.dioxide + I(pH^7), data = wqr)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) + 
##     chlorides + total.sulfur.dioxide + I(pH^7) + citric.acid, 
##     data = wqr)
## 
## ==========================================================================================================================
##                              m1            m2            m3            m4            m5            m6            m7       
## --------------------------------------------------------------------------------------------------------------------------
##   (Intercept)               1.875***      3.095***      3.369***      3.742***      3.998***      4.003***      4.099***  
##                            (0.175)       (0.184)       (0.184)       (0.201)       (0.208)       (0.207)       (0.212)    
##   alcohol                   0.361***      0.314***      0.303***      0.285***      0.270***      0.289***      0.295***  
##                            (0.017)       (0.016)       (0.016)       (0.016)       (0.016)       (0.017)       (0.017)    
##   volatile.acidity                       -1.384***     -1.156***     -1.099***     -1.076***     -0.940***     -1.043***  
##                                          (0.095)       (0.097)       (0.098)       (0.097)       (0.101)       (0.114)    
##   I(log10(sulphates))                                   1.477***      1.794***      1.843***      1.849***      1.894***  
##                                                        (0.177)       (0.190)       (0.189)       (0.188)       (0.190)    
##   chlorides                                                          -1.694***     -1.729***     -2.063***     -1.935***  
##                                                                      (0.383)       (0.380)       (0.385)       (0.390)    
##   total.sulfur.dioxide                                                             -0.002***     -0.002***     -0.002***  
##                                                                                    (0.001)       (0.001)       (0.001)    
##   I(pH^7)                                                                                        -0.000***     -0.000***  
##                                                                                                  (0.000)       (0.000)    
##   citric.acid                                                                                                  -0.228     
##                                                                                                                (0.118)    
## --------------------------------------------------------------------------------------------------------------------------
##   R-squared                 0.227         0.317         0.345         0.353         0.361         0.370         0.371     
##   adj. R-squared            0.226         0.316         0.344         0.352         0.359         0.367         0.368     
##   sigma                     0.710         0.668         0.654         0.650         0.646         0.642         0.642     
##   F                       468.267       370.379       280.646       217.837       180.338       155.588       134.130     
##   p                         0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -1721.057     -1621.814     -1587.752     -1577.984     -1568.023     -1557.699     -1555.809     
##   Deviance                805.870       711.796       682.108       673.825       665.482       656.943       655.393     
##   AIC                    3448.114      3251.628      3185.503      3167.967      3150.046      3131.397      3129.619     
##   BIC                    3464.245      3273.136      3212.389      3200.230      3187.686      3174.414      3178.013     
##   N                      1599          1599          1599          1599          1599          1599          1599         
## ==========================================================================================================================

The variables in this linear model can account for 37.0% of the variance in the quality of the wine. By using log10(sulphates) and pH^7, we could improve the result compared to 35.7% without transformation.

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) + 
##     chlorides + total.sulfur.dioxide + I(pH^7) + citric.acid, 
##     data = wqr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.63753 -0.37786 -0.03801  0.44159  1.96876 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.099e+00  2.123e-01  19.308  < 2e-16 ***
## alcohol               2.948e-01  1.708e-02  17.260  < 2e-16 ***
## volatile.acidity     -1.043e+00  1.138e-01  -9.161  < 2e-16 ***
## I(log10(sulphates))   1.894e+00  1.895e-01   9.995  < 2e-16 ***
## chlorides            -1.935e+00  3.905e-01  -4.954 8.04e-07 ***
## total.sulfur.dioxide -2.207e-03  5.023e-04  -4.394 1.18e-05 ***
## I(pH^7)              -6.244e-05  1.265e-05  -4.936 8.83e-07 ***
## citric.acid          -2.281e-01  1.176e-01  -1.940   0.0525 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6418 on 1591 degrees of freedom
## Multiple R-squared:  0.3711, Adjusted R-squared:  0.3684 
## F-statistic: 134.1 on 7 and 1591 DF,  p-value: < 2.2e-16

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Transformation of sulphate and pH increases the correlations with other variables. These transformations give clue to make a better linear model.

Were there any interesting or surprising interactions between features?

High alcohol and low volatile acidity contents seem to produce better wines.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I created a couple of linear models. Though the confidence level of the model could be increased a bit by transforming a couple of variables, the final model still is not satisfactory. This can be due to the fact that our dataset contains a small number of observations. Furthermore, most of the observations are from middle-classed wines. This makes it difficult that the model predict the edge cases. Maybe a more supplement dataset with more edge cases would help to predict the accurate quality of wines.


Final Plots and Summary

Plot One

Description One

Alcohol percentage plays a primary role in determining the quality of wines. The higher the alcohol percentage, the better the wine quality. But previously from our linear model test, R-Squared value tells that alcohol alone contributes only about 22% in the variance of the wine quality. So alcohol is not the only factor which is responsible for the improvement in wine quality.

Plot Two

Description Two

The volatile acidity has a negative relation with wine quality, though it is weaker than that of alcohol. It seems that the main component of volatile acid is an acetic acid which causes the unpleasant vinegar taste.

Plot Three

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Description Three

We can see that the model fails to predict the good and bad quality wines. This is evident from the fact that most data sets contain ‘average’ quality wine and there are insufficient observations in the extreme range. The R-squared value of our model can only account for about 37.1% observations.


Reflection

In this data, my main struggle was to get a higher confidence level when predicting factors that are responsible for the production of different quality of wines especially the ???Good??? and the ???Bad??? ones. As the data was very centralized towards the ???Average??? quality, my training set did not have enough data on the extreme edges to accurately build a model which can predict the quality of a wine given the other variables with lesser margin of error. So maybe in future, I can get a dataset about Red Wines with more complete information so that I can build my models more effectively.

Initially when I was writing and developing for this project, I saw that some wines didn???t have citric acid at all. Also the others showed almost a rectangular distribution. My first thought was maybe this was bad data or incomplete data. But then I researched further about wines. I saw that citric acid actually is added to some wines to increase the acidity. So it???s evident that some wines would not have Citric Acid at all. So actually this was in parallel to my experimental findings.

The other variables showed either a Positively skewed or a Normal Distribution.

First I plotted different variables against the quality to see Univariate relationships between them and then one by one I threw in one or more external factors to see if they together have any effect on the categorical variable. I saw that the factors which affected the quality of the wine the most were Alcohol percentage, Sulphate and Acid concentrations.

I tried to figure out the effect of each individual acid on the overall pH of the wine. Here I found out a very peculiar phenomenon where I saw that for volatile acids, the pH was increasing with acidity which was against everything I learned in my Science classes.

But then to my utter surprise, for the first time in my life as a data analyst, I saw the legendary Simpson???s Paradox at play where one lurking variable was reversing the sign of the correlation and in turn totally changing the trend in the opposite direction.

In the final part of my analysis, I plotted multivariate plots to see if there were some interesting combinations of variables which together affected the overall quality of the wine. It was in this section I found out that density did not play a part in improving wine quality.

For future analysis, I would love to have a dataset, where apart from the wine quality, a rank is given for that particular wine by 5 different wine tasters as we know when we include the human element, our opinion changes on so many different factors. So by including the human element in my analysis, I would be able to put in that perspective and see a lot of unseen factors which might result in a better or worse wine quality. Having these factors included inside the dataset would result in a different insight altogether in my analysis.